migrating birds

Drupal 8: Import an Atom feed using Migrate

I needed to get events from another system into drupal 8. They were available through an atom-feed. Now, the feeds module is/was, at the time of writing, not ready for Drupal 8.
Another way of importing (external) sources is through the core module Migrate. This has, however, no UI, so it is not a point-&-click-operation.
I like to share with you how to write a (simple) migration plugin. Bare in mind though that each migration is very specific and have their own particular needs.

So what do we need?

This blog post pointed me in the right direction.
Besides core Migrate we need a couple of contrib modules to get what we want:

Later I found that using Migrate File (extended) makes dealing with images real easy!

The source file

We need to know what sort of file we're dealing with exactly, what is in it and how do we get it out?
This is (a part of) the source xml:
I've removed some parts to keep it small, I left only 1 entry, but they are all the same

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xmlns:gd="http://schemas.google.com/g/2005" xmlns:pro="http://schemas.example.com/2011">
    <id>tag:example.com,2011-05-01:https://example.example.com/web/feeds/events</id>
    <updated>2018-05-14T17:56:33+02:00</updated>
    <category scheme="http://schemas.google.com/g/2005#kind" term="http://schemas.google.com/g/2005#event" />
    <title type="text">Example event feed</title>
    <subtitle type="text">Events for example</subtitle>
    <generator version="1.0" uri="https://example.example.com/">example</generator>
    <author>
        <name>example</name>
        <email>tickets@example.org</email>
    </author>
    <entry>
        <pro:type value="event" />
        <pro:id value="33233" />
        <pro:eventGroupId value="213066" />
        <category scheme="http://schemas.google.com/g/2005#kind" term="http://schemas.google.com/contact/2008#contact" />
        <title>House of cards</title>
        <pro:subtitle>Lorem Ipsum</pro:subtitle>
        <link rel="self" type="application/atom+xml" href="https://example.example.com/web/feeds/events/43234455" />
        <link rel="alternate" type="text/html" href="https://example.example.com/events/33233" />
        <link rel="alternate" type="text/calendar" href="https://example.example.com/web/events/233223.ics" />
        <link rel="related" type="text/html" href="https://example.example.com/events/listday?month=5&amp;year=2018&amp;day=04" />
        <id>tag:example.com,2011-05-01:https://example.example.com/web/feeds/events/2343443</id>
        <published>2018-03-20T17:17:16+01:00</published>
        <updated>2018-05-03T20:03:55+02:00</updated>
        <category scheme="http://schemas.example.com/2011#eventType" term="dancenight" label="Dancenight" />
        <pro:tags>rock, pop</pro:tags>
        <gd:eventStatus value="http://schemas.google.com/g/2005#event.confirmed" />
        <pro:spaces>
            <pro:space>SUB</pro:space>
        </pro:spaces>
        <gd:where rel="http://schemas.google.com/g/2005#event" label="example">
            <gd:entryLink>
                <entry>
                    <title>example</title>
                    <category scheme="http://schemas.google.com/g/2005#kind" term="http://schemas.google.com/contact/2008#contact" />
                    <gd:structuredPostalAddress primary="true">
                        <gd:street>ladida 39</gd:street>
                        <gd:city>Sometown</gd:city>
                        <gd:postcode>0000 AA</gd:postcode>
                        <gd:formattedAddress>example, ladida 39, 0000 AA Sometown </gd:formattedAddress>
                    </gd:structuredPostalAddress>
                    <summary>example, ladida 39, 0000 AA Sometown  </summary>
                </entry>
            </gd:entryLink>
        </gd:where>
        <gd:when startTime="2018-05-04T22:00:00+02:00" endTime="2018-05-05T02:00:00+02:00" />
        <gd:extendedProperty name="http://schemas.example.com/2011#doorsOpen">
            <gd:when startTime="2018-05-04T22:00:00+02:00" />
        </gd:extendedProperty>
        <pro:tickets isSoldOut="false">
            <pro:total>0</pro:total>
            <pro:remaining>0</pro:remaining>
        </pro:tickets>
        <link rel="hyperlink" href="https://www.example.net/" title="poop poop de doop" featured="false" />
        <link rel="audiolink" href="http://hcmaslov.d-real.sci-nnov.ru/public/mp3/Queen/Queen%20'A%20Kind%20Of%20Magic'.mp3" featured="false" />
        <link rel="videolink" href="https://www.youtube.com/watch?v=dQw4w9WgXcQ" featured="false" />
        <link rel="image" href="https://example.example.com/images/7/eventpublicationitem/320078/Screen Shot 2018-05-01 at 13.18.21.png" featured="false" />
        <content type="html">&lt;p&gt;Lorem ipsum dolor sit amet, consectetur adipiscing elit. Integer pulvinar nibh nec ante eleifend pulvinar. Nulla molestie vel justo ac faucibus. Duis consectetur eu ipsum at dictum. Suspendisse convallis hendrerit leo a molestie. Quisque sollicitudin felis velit, nec laoreet massa tincidunt et.&lt;/p&gt;&lt;p&gt;Aenean nec gravida mi, sodales hendrerit purus. Mauris gravida risus ipsum, sit amet porta tellus vehicula feugiat. Donec posuere fringilla sapien vel vestibulum. Nunc nec scelerisque ligula. Donec vitae tempus nulla, rhoncus egestas lacus. Nulla rutrum nec nulla ut ullamcorper. Ut consectetur blandit libero non eleifend. Duis ut rutrum sem. Sed dapibus lectus vel metus dictum euismod. Donec at purus vitae elit mattis consequat fringilla quis massa. Nam at velit sed lorem ultricies semper. Quisque viverra congue mi, at venenatis purus vehicula nec.&lt;/p&gt;</content>
    </entry>
</feed>
CLICK TO EXPAND
To keep things to the point I will limit this description to the fields which need a specific way to get their value or to pass that value on to our migration plugin.
Since we are importing events; I made a contenttype 'event' with the following fields:
  • Title (default)
  • Body (basic_html)
  • Image (core image field)
  • Start Date (date field)
  • Address (plain text)
Looking at our source we can match the (xml-)nodes with their corresponding fields.
The Title will be <title>, the body will be <content> the image must be the 'href'-attribute of the <link rel="image">, the start date is on the <gd:when>-tag and the address is inside this <gd:where>-tag, mhhh...
Also each item in our migration needs a unique ID so rollbacks and update are possible. We'll use the value of <pro:id/> for that.

A basic migration file has at least a few things: an ID and a label, a source, a process, a destination and its dependencies.
It looks something like this:

1
2
3
4
5
6
id: events_importer        //name by which to call your plugin
label: 'Import Atom feed Example'
source:                    //your incoming items
process:                   //the processing & mapping of values
destination:               //the entity to save to
dependencies:              //other modules that are needed
The source describes the incoming things. In it we define the url, the filetype, the parser we wnat to use, the items selector, the 'fields' we've got on each item and also the namespaces of our xml. It took me quite some time to figure out how to work with those. I had not worked with xml in a while (it´s all JSON now).

Xpath

The configuration for xml in migration.yml files works with xpaths. So in this example our 'title' is at: /feed/entry[1]/title and the 'content' at /feed/entry[1]/content. If we test these paths in this online xpath-tester would get us their values. If we try to get the ID however, using /feed/entry[1]/pro:id/@value we get nothing.
This is because of the namespacing. The id-tag is <pro:id value="33233"/>. The 'pro' is the namespace.

Namespaces

Ok, so we have namespaces. How can we get the values of those fields? Our 'normal' xpaths return null.
This page explains namespaces and how they affect xpath.
We need to add those namespaces to the source declaration our yaml file to be able to use them.
In a migration configuration this is simple, our file now looks like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
id: events_importer
label: 'Import Atom feed Example'
 
source:
  plugin: url                                         //the migrate plugin to use for fetching our file.
  data_fetcher_plugin: http                           //the protocol to use to get the file
  data_parser_plugin: xml                             //the type of date to parse
  namespaces:                                         //declaration of namespaces found in our source file.
    atom: 'http://www.w3.org/2005/Atom'
    gd: 'http://schemas.google.com/g/2005'
    pro: 'http://schemas.example.com/2011'
  urls: 'https://us.example.com/feeds/events'         //the url of our source
  item_selector: '/feed/entry'                        //the items

As you can see I've added all the declared namespaces, including the Atom one. In our xml the namespace are declared at the top of the file with the "xmlns=" attribute.

1
<feed xmlns="http://www.w3.org/2005/Atom" xmlns:gd="http://schemas.google.com/g/2005" xmlns:pro="http://schemas.example.com/2011">
This determines our xpaths.
Another thing to notice is the item_selector, in this case '/feed/entry'. We have many 'entry' nodes (in the example there is only one). Settings this will make the script iterate over them all.
The fields that we will add to the source are the fields on these entries.

So now we can add the fields from which we want the values, as described above. The fields key is part of our source, so the file above continues like this

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
//since we are iterating over the items, the /feed/entry[i]/ part of our xpath is 'given', so the fields are relative to that.
  fields:
    -
      name: guid
      label: 'GUID'
      selector: 'pro:id/@value'                //because we added the namespace declaration we can use this like this.
    -
      name: title
      label: 'Title'
      selector: 'atom:title'                    //since I also added atom: to the namespace declaration, our title path becomes this.
    -
      name: body
      label: Body
      selector: 'atom:content'
    -
      name: from_date
      label: 'From Date'
      selector: 'gd:when/@startTime'
    -
      name: address
      label: 'Location Address'
      selector: 'gd:where/gd:entryLink/atom:entry/gd:structuredPostalAddress/gd:formattedAddress'
    -                                             //the above path is rather funky, but it works.
      name: image
      label: 'Event image'
      selector: 'atom:link[@rel="image"]/@href'   //this gets the image-link (in case others are ommitted we do not use link[8])
CLICK TO EXPAND

So this gives us all our values. We now need to 'convert' some of them to the right format, or pass them to an additional plugin to 'preprocess' them.
How this should be done can be described in the process 'array'. Like so:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
process:
  uid:                                            //the user that created the item, it can be any existing user
    plugin: default_value                         //we use the default_value plugin to set the default
    default_value: 1                              //in this case user 1
  title: title
  'body/value': body                              //We split the body into its subparts 'value' and 'format' to set their values
  'body/format':
    plugin: default_value
    default_value: basic_html                     //make sure 'basic_html' exists
  status:                                         //the published value, we set it to TRUE
    plugin: default_value
    default_value: 1
  type:                                           //The contenttype
    plugin: default_value
    default_value: event
  field_from_date:                                //We use the format_date plugin to transform the source date to a drupal format
    plugin: format_date
    from_format: 'Y-m-d\TH:i:sP'
    to_format: 'Y-m-d\TH:i:s'
    source: from_date
  field_location_address: address
  field_event_image:                              //The image is downloaded via the image_import subplugin of migrate_file
    plugin: image_import                          //It returns an image entity reference that we can set on our element
    source: image
    destination: constants/file_destination       //The constant is declared in the source 'array'. More on that below.
    uid: '@uid'                                   //This references the uid as set above, mind the quotes!
    alt: !image                                   //The ! renders the value as a string.
    skip_on_missing_source: true                  //this is usefull so it does not fail when no image is present
CLICK TO EXPAND
The resulting configuration file (partially) is this:
I've left out some fields that were simular to the ones I allready showed you.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
id: events_importer
label: 'Import Atom feed Example'
status: true
source_type: 'XML files'
 
source:
  plugin: url
  data_fetcher_plugin: http
  data_parser_plugin: xml
  track_changes: true
  namespaces:
    atom: 'http://www.w3.org/2005/Atom'
    gd: 'http://schemas.google.com/g/2005'
    pro: 'http://schemas.example.com/2011'
  urls: 'https://namespace.example.com/feeds/events'
  item_selector: '/feed/entry'
  fields:
    -
      name: guid
      label: 'GUID'
      selector: 'pro:id/@value'
    -
      name: title
      label: 'Title'
      selector: 'atom:title'
    -
      name: body
      label: Body
      selector: 'atom:content'
    -
      name: from_date
      label: 'From Date'
      selector: 'gd:when/@startTime'
    -
      name: address
      label: 'Location Address'
      selector: 'gd:where/gd:entryLink/atom:entry/gd:structuredPostalAddress/gd:formattedAddress'
    -
      name: image
      label: 'Event image'
      selector: 'atom:link[@rel="image"]/@href'
 
  constants:
     file_destination: 'public://imports/'
  ids:
    guid:
      type: integer
 
destination:
  plugin: entity:node
  default_bundle: event
 
process:
  uid:
    plugin: default_value
    default_value: 1
  title: title
  'body/value': body
  'body/format':
    plugin: default_value
    default_value: basic_html
  status:
    plugin: default_value
    default_value: 1
  type:
    plugin: default_value
    default_value: event
  field_from_date:
    plugin: format_date
    from_format: 'Y-m-d\TH:i:sP'
    to_format: 'Y-m-d\TH:i:s'
    source: from_date
  field_location_address: address
  field_event_image:
    plugin: image_import
    source: image
    destination: constants/file_destination
    uid: '@uid'
    alt: !image
    skip_on_missing_source: true
 
 
migration_dependencies:
  required: {}
 
CLICK TO EXPAND

Running the import (finally!)

We are finally ready to actually run this script and import our items
(in reality you'll probably be testing this many many times before you get it right)
There are several ways to test you script. Perhaps the easiest is to import the yml file via the configuation synchronisation. Select 'import/single item/' select migration for the type and paste your yaml code into the box and click import. Then run:

1
drush migrate:import events_importer
A better way though would be to make it into a custom module, install it and then run the drush command. Here is how: ...... ......

Once you are sure your migration works you can leave it to cron to run it, with that same command.

Things that will go wrong:

Most likely you will get an error at some point during development. A few of the things that I found went wrong:

  • I needed two patches to get around some errors (this may be fixed by now).

    The first thing was:

    1
    
    [error]  Migration failed with source plugin exception: Serialization of &#039;SimpleXMLElement&#039; is not allowed.
    I used this patch.

    The other one apparently has to do with drush 9 not implementing a function:
    1
    
    [error] Error: Call to undefined function Drupal\migrate_tools\Commands\drush_print_table() in Drupal\migrate_tools\Commands\MigrateToolsCommands->messages()
    I used this patch.

  • At the point at which I added the importing of images my Migration simply failed, without displaying any errors. It just said:
    1
    
    [notice] Processed 0 items (0 created, 0 updated, 0 failed, 0 ignored) - done with 'events_importer'
    If this happens you can always run
    1
    
    drush migrate-messages events_importer
    to see if anything was logged.
    In my case, the permissions for the 'public://imports/' directory were wrong and it could not be created. Once that was fixed the import ran and imported all my items, including the images!
  • Another thing that might happen is that your migration fails, due to an error, and is therefor unable to reset its status back to 'idle'. The status then remains 'Importing' and you will not be able to run you migration again.
    you'll see an error like:
    1
    
    [error] Migration events_importer is busy with another operation: Importing
    To see the status run:
    1
    
    drush migrate:status
    You'll see a list of migrations with their status and when they are last executed.
    To reset the status of one of them (to 'idle'), run:
    1
    
    drush migrate-reset-status events_importer   //<-- events_importer being the id of the one to reset
  • Another usefull thing to know is how to remove your configuration files.
    Via
    1
    
    drush php
    you get the shell. In it you can run php. The following snippet will remove the configuration from your system:
    1
    
    Drupal::configFactory()->getEditable('migrate_plus.migration.events_importer')->delete();
    This will allow you to re-install your module.

    A better (or easier) way to deal with this is to make sure your MODULE.install does this for you while uninstalling the module, in which case you need to only uninstall and re-install the module.
    I used this snippet:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    
    <?php
     
    /**
     * @file
     * events_import install file.
     */
     
    /**
     * Implements hook_uninstall().
     */
    function events_import_uninstall() {
      // Delete this module's migrations.
      $migrations = [
        'events_importer',
      ];
      foreach ($migrations as $migration) {
        Drupal::configFactory()->getEditable('migrate_plus.migration.' . $migration)->delete();
      }
    }

Sources

I have used many different sources to find out how to make this work. They may be usefull to you too.

Happy migrating!

Neem contact op